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Abstract 

We present the multidimensional membership mixture (M 3 ) models where every 
dimension of the membership represents an independent mixture model and each 
data point is generated from the selected mixture components jointly. This is help- 
ful when the data has a certain shared structure. For example, three unique means 
and three unique variances can effectively form a Gaussian mixture model with 
nine components, while requiring only six parameters to fully describe it. In this 
paper, we present three instantiations of M 3 models (together with the learning 
and inference algorithms): infinite, finite, and hybrid, depending on whether the 
number of mixtures is fixed or not. They are built upon Dirichlet process mixture 
models, latent Dirichlet allocation, and a combination respectively. We then con- 
sider two applications: topic modeling and learning 3D object arrangements. Our 
experiments show that our M 3 models achieve better performance using fewer top- 
ics than many classic topic models. We also observe that topics from the different 
dimensions of M 3 models are meaningful and orthogonal to each other. 



1 Introduction 

Inherited from the mixture models' ability to approach complicated distribution using a small set of 
simpler distributions, topic models are used to capture 'topics' — distributions over the vocabulary — 
shared across different documents |8 3]|4T). This concept has also been successfully applied to 
building image hierarchy [4, 23, 40 24 1, where feature-object-scene relationships follow the word- 
topic -document analog. 

Most previous models consider that a word is generated from one type 
of topic (which we call a single-dimensional membership). However, 
in some cases, an observation may be generated from two or multiple 
types of topics. As a pedagogical example, let us consider a case of 
Gaussian mixture modeling. Fig. [T] shows a dataset generated from 10 
different Gaussian distributions, while we only need five unique means 
and two unique covariances. Unless we capture this effect — that the data 
is generated by two components from different parameter spaces — our 
model would not be parsimonious and thus susceptible to over-fitting. 
Such a scenario also happens in document modeling. 

In this work, we present multidimensional membership mixture (M 3 ) 
models in which every data point has a multidimensional membership. 
Each dimension is a draw from an independent mixture model. As for 
the example above, we can use one mixture model for the means and 

another for the covariances. As a result, each point has a 2D membership. This leads to parsimonious 
representations when the data has a certain shared structure. 

Now let us take topic modeling for the NIPS corpuJ^as an example. Topics that the words are drawn 
from are the result of combining common topics (e.g., 'algorithm' and 'result') and session-specific 
topics (e.g., 'neural' and 'control')|^| Those topics are orthogonal in the sense of that papers of either 

'For simplicity, we use one keyword to represent a topic here. 



Figure 1: A mixture of ten 
Gaussians, with five unique 
means and two unique co- 
variance matrices. 
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neuroscience or control theory would have content about methods and results. Using M 3 models 
can not only obviate the need of unnecessary topics, but also identify such orthogonal structures in 
the data. 

We can also view this multidimensional membership as a way of sharing parameters — observations 
assigned to different mixture components in one dimension may be assigned to the same component 
in another dimension and thus effectively share the parameters. This is different from hierarchical 
topic models, such as hierarchical DP ifTllRTIRTl . Although they allow sharing among components 
branching from the same ancestors, the total number of components needed to model the data is not 
reduced accordingly. 

Similar to other topic/mixture models, M 3 models need to infer latent membership for each data 
point. But the coupling of multiple mixture models through observed data makes inference more 
challenging. In this paper, we formulate and derive three instantiations of M 3 models: infinite M 3 , 
finite M 3 and hybrid M 3 , according to whether the number of mixture components is fixed or not. 
They are built upon multiple Dirichlet process mixture models (DPMMs), multiple latent Dirichlet 
allocation (LDA) models, and a combined DPMM and finite mixture model. 

In the experiments, we first present a proof-of-concept example of applying the infinite M 3 model 
on the task of Gaussian mixture modeling. We then consider the task of document modeling. We 
compare our finite M 3 model with different topic models (including LDA, hierarchical DP, etc.) on 
several corpora. Extensive experiments show that our model achieves lower perplexity on the hold- 
out documents while having fewer topics than the baselines. The extracted topics from different the 
dimensions also reflect certain orthogonality. 

Finally, we also consider the problem of learning 3D object arrangements (which is useful in scene 
understanding, object recognition and robotic applications). An object being at a particular place 
is governed by two orthogonal factors — its affordances (i.e., how it is used by humans such as 
drinking or touching) |29| and potential human poses (e.g., sitting in a chair or browsing a book 
shelf) lfl4l[T0l . Therefore we use two independent mixture models and enable objects of different 
usages be associated with one human pose and vice versa. Results show that our hybrid M 3 model 
outperforms both finite and infinite mixture models whose mixture components are defined on the 
joint space of human poses and object affordances. 

2 Related Work 

There is a huge body of work employing mixture models. Here we only name a few in the area of 
probabilistic topic models. More recent developments in topic modeling can be found in fTl l39l . 

Most methods used to extract topics from a document corpus are grounded in latent variable models 
and statistical decomposition techniques, such as mixture of unigrams model [ 33 1 and probabilistic 
latent semantic indexing model (pLSI) (161. Later, Blei et al. [3| proposed LDA to model the 
hierarchy of a corpus to allow different documents to share similar topic proportions and words from 
one document are sampled from the same topic distribution. It is later extended to nonparametric 
hierarchical models lTTTll4Tll47l . so that the hierarchy and the number of topics are learned together. 
The inferential difficulty in those topic models can be alleviated by using variational inference ll28l 
[3l l42l or MCMC sampling J3T), including collapsed Gibbs sampling ifTSl . These models consider 
single-dimensional topics, and therefore are complementary our method. 

Many models relax the assumptions in LDA by modeling word non-exchangeability B4l[T3l , or by 
modeling the correlations among topics [2 20, 34]. These ideas are complementary to ours, and 
similar techniques may be applied to M 3 models. Further, there has also been work on incorporating 
other meta-data such as authors ||36l l6l, citations [30 1, and tags [7|. Our M 3 model does not require 
such meta-data. More importantly, none of these extensions consider the factorization of topics into 
multiple mixture models. 

Topic model have been widely applied to computer vision applications such as building image hier- 
archy [23], object detection BOl . activity recognition [46 1, classification, annotation and segmenta- 
tion 11241 . In some applications, the model is augmented with spatial information to yield spatially 
coherent topics |45|. However, none of the models presented in these works consider generation of 
data-points as a multi-dimensional mixture of topics. 

There are previous works in matrix factorization [9|, factored models ll35ll and parameter sharing 
ETI [T71 l22l l27l [32l . where a lower dimensional representation of the parameters is used. Even 
though these approaches are quite different from our M 3 models, they are relevant to our work since 
M 3 model also uses a compact "factored" representation for the parameters. 
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Our model does not diverse far from multidimensional clustering J5], two-way groupings 11371 [TBI 
and some biclustering models [25 1. However in this work, we are interested in modeling the poste- 
rior density and topics instead of clustering. 

Many collaborative filtering methods are also built upon mixture models |fT9ll26l , where user pref- 
erences are often modeled by different mixture components. This is similar to classic topic models 
whose membership is single-dimensional. The flexible mixture model l38l however considers 2D 
membership (user and object group) for an observed rating score. It is close to our work but we 
consider (Dirichlet) priors and infinite number of groups. 

3 Multidimensional Membership Mixture (M 3 ) Models 

We present the general idea of our approach in this section, and then describe three specific instan- 
tiations in the next section. 

A mixture model typically consists of K mixture components, each of which is a distribution param- 
eterized by 9k, denoted as F(9k)- Drawing a data point x involves choosing a mixture component 

z € {1, ... , K} according to the mixing proportions tt = (tti, . . . , ttk ) (subject to J2k=i nk = ■"•)' 
and then drawing from F(9 Z ), i.e., 

zItt-tt; x\z,6 ~ F(6 Z ). (1) 

Depending on whether K is fixed or not, mixture models can be categorized as finite and infinite (or 
nonparametric) mixture models. 

Our M models assume data is generated jointly by several independent mixture models. Partic- 
ularly, an L-dimensional M 3 model of L different mixture models, each having K e components 
parameterized by 0%. Now, generating a data point x involves choosing a mixture component 
z e E {1, . . . , K e } for each of the L dimensions. Given 9 = (0jptzi an d z = (z 1 , . . . , z L ), we 
then draw x from the distribution parameterized by the selected L mixture components together: 

z e \n e ~7v e , Vi=l,...,L; x\z, 6 ~ F{fi\i, . . . ,6% L ). (2) 

Note that the domain of the density function F is now a Cartesian product of the domains of L 
mixture models, which may or may not be the same. 

M 3 models are related to the standard mixture models in the following ways. When L = 1, it 
degenerates to the standard mixture model. When L > 1, we can also cast it into an equivalent 
(single-dimensional) mixture model by defining a new mixture component for any combination of 
L components as 9' k — (8^ , . . . , 9j L ) where jg g {1, . . . , K e }. This leads to a total of Ylf=i 
mixture components. When L or K l is large, the corresponding mixture model would be prohibitive 

to compute and may tend to over-fit the data. On the other hand, M 3 models only construct 53/=i K l 
mixture components. While this is much more parsimonious, our method relies on the assumption 
that the data is drawn from shared mixture components whose parameters are generated from in- 
dependent processes. Our model would also be able to obtain better estimates of the parameters 
because now more observations would effectively be used for computation. 

Note that an L-dimensional M 3 model is not the same as L independent mixture models, as they are 
linked through the observations. This coupling would result in challenges in the inference, such as 
when optimizing parameters of the L mixture models jointly or sampling from their joint posterior 
distribution. In the following sections, we will show how to derive and use some specific M 3 models. 
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4 Formulation and Inference for Three Instantiations of M models 

In this section, we describe our three specific instantiations of the M 3 model. Each is a combination 
of L = 2 mixture models, and we informally call them 2-D M 3 |^] Experiments on each of these 
models will be presented in Section [5] 

4.1 Infinite A/ 3 Models using Dirichlet Processes 

When the number of components, K, is unknown, nonparametric Bayesian methods are often used. 
For example, Dirichlet process mixture model (DPMM), which is also referred as infinite mixture 
model, can adapt K to the data automatically (overview of DP can be found in [43 1). DPMM can be 
constructed using a stick-breaking process: 

7r~GEM(l,a) 6 k ~G zfa ~ tt Xi \zi, 9 1:oo ~ F(6 Zi ) (3) 

where Go is the base distribution of 6 and a is the concentration parameter. Chinese restaurant 
process provides another perspective to understand how Zi is selected: 

■ _j I , , n \ , if z is previously used ... 

Zi = z\z = < N-l+a J (4) 

», " , — otherwise 

I. N— 1+a 

where superscript — i denotes everything except the i th instance and n~ % equals the number of data 
points assigned to the component z excluding x t . 

We formulate the 2-D infinite M 3 model as a combination of two DPMMs, as shown in Fig. ^p. 
Each DP mixture model follows the same stick-breaking process as in Eq. Q, except that Xi is now 
sampled as 

x. l \zlzle\, aol el 00 ^F{el ]1 el 1 ). (5) 

We now define the conditional distribution for zj, zf, the counterpart of Eq. Q in our M 3 model. 
Let equals the number of observations (excluding xi) with zj — c and zj = d. And let 

= J2d n cd an d n ~d ~ X) c n cd ■ If we assume z \ and z 2 are independent, the joint conditional 
is decomposed into their own conditional same as Eq. Q by replacing n~ l with n~ l and n~ l . 
However, this is such a strong assumption that does not hold in general. Therefore, we introduce 
a sharing parameter to E [0, 1] to control the correlation between the two. We define the joint 
conditional as follows]^] 



' (l-uj)n~ i n *+(jjn a *(N-l) 
(N-l)(N-l+a) 



z\ = c,zf = dlz 1 ' \z 2 ~ 



n~ % > and nj > 
> 



to a n„ 



(JV-l)(JV-l+a) 
u)a> 



(N-Vi(N-_l+ a ) > L -d " u (6) 

n~ l > 

v N-i+a otherwise 

with oj 1 and to 2 (subject to uj + uj 1 + lj 2 = 1) used to tune the relative concentration parameter 
between the two dimensions. Under this definition, cj constructs a smooth continuum between 2-D 
M 3 models and 1-D M 3 models (same as DPMM): when w = 1, zj and zf will always be equal and 
hence it simply boils down to a single DPMM; when uj = 0, the equation above decomposes into 
two distributions similarly to Eq. Q for zj and zf respectively. 

We can use the algorithm of Gibbs sampling with auxiliary parameters ifJTl to sample z 1 , z 2 , 8 1 , 2 
from their posterior distributions: 

1. For i — 1, . . . , n: sample zj and zf according to Eq. Q multiplying f(x i \9j ll 9\); 

2. Foil = 1,2: sample^ with the probability given by G l (6 l k ) H i:z ! =k /(aul^i, 2 2 ). 

Here /(-l^ 1 , 2 ) is the density function of distribution F{9 1 ,9 2 ). 

Implementations of the second step differ depending on which F and G 1 ' 2 are used. For example, 
finding conjugate priors for exponential family distributions is easy for M 3 , but it is not so when G 1:2 
are both Dirichlet distributions because they are no longer the conjugate priors for a multinomial 
distribution F. In such cases, other methods based on sampling such as Metropolis-Hastings f3D or 
Gibbs sampling lfl2"ll4Tl could be used depending on the distribution. In this paper, we will present 



a concrete example with Fasa normal distribution and G is its conjugate prior in Section 5. 1 
2 Although we only present 2-D cases, generalization to L > 2 is straightforward. 

3 Note that the superscript in the symbols z, uj, etc. denotes the dimension (£=!,..., L), not the exponent. 
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4.2 Finite M 3 Models for Topic Modeling 

Latent Dirichlet allocation (LDA) employs a hierarchical finite mixture model to describe the gen- 
erative process of a document: first, a topic proportion n over K topics is drawn from a symmetric 
Dirichlet distribution with prior a; then a topic is chosen for each word x i7 and x t is drawn from 
9 Zi , a multinomial distribution over the whole vocabulary of size V. Thus, LDA allows words from 
the same document share similar topic distributions while documents share finite topics. However 
in many real-world datasets, words could be generated from several different types of topics. In this 
section, we model each type of topic as a dimension in our M model. 

We define our 2-D finite M 3 model as follows (shown in Fig. [2J1): we assume two independent 
topic spaces and a word is drawn from a topic synthesized by two topics — one from each topic 
space — with the probability of 

where to £ [0, 1] tunes the weight of the two topic models in forming a new topic. It serves similar 
purpose as the ui in Eq. Q: the 2-D finite M 3 model degenerates to the classic LDA when ui = 1. 

Same as other topic models, the goal of applying finite M 3 model to corpora is density estimation, 
i.e., to maximize the likelihood of the test documents. After integrating out 7r 1:2 and z 1:2 , we obtain 
the likelihood of a document w = (xi, . . . ,xn) conditioned on the model as, 



P (w|a 1:2 ,e 1:2 ) cx / / IlK 1 )^ 1 IlK 2 )" 




9 2 Z 2 X . I d^d-K 2 . (8) 



This distribution is intractable to compute in general. We therefore approximate it using a variational 
inference similar to the one used in LDA J3) . 

Variational Inference. Following the classic LDA method, we use the variational distribution, 

<7(7r 1:2 ,z 1:2 |7 1:2 , (f> 1:2 ) = g( 7 r 1 |7 1 )g(7r 2 |7 2 ) f^Li ?(4l#0«(*£l$l). as an approximation to the 
true posterior distribution p(ir 1:2 , z 1:2 |w, a 1:2 , 1:2 ). The difference between the two is quantified 
by the KL divergence: 

D(q\\p) = logp(w| a 1:2 ,6 1:2 )-£(7 1:2 ^ 1:2 ;a 1:2 ,e 1:2 ). ^ (9) 

Since KL divergence is always non-negative, C above is the lower bound of p(w|a 1:2 , /? 1:2 ). There- 
fore, our goal is to maximize C so that the likelihood p(w\a 1:2 , /3 1:2 ) can be large as well. During 
inference, the goal is to optimize C with respect to c/> 1:2 and 7 1:2 for each document. This is similar 
to LDA and thus we provide details only in the supplementary material. During training, given D 
documents, our goal is to find the model's parameters that maximize L. We solve it by iteratively 
inferring (<p^ 2 , 7^ :2 ) for each document and estimating a 1 ' 2 , 1:2 and lu given the rest. The step 
of parameter estimation is more challenging than the classic LDA due to the entanglement of the 
two topics and the sharing parameter u>. 

Parameter Estimation. When the variational distribution is fixed, the terms involving a 12 in C are 
as follows. Here, T(-) is the Gamma function, and $(•) is the digamma function. 

2 / K* K* \ 

C a v, = E lo g r(K t a*)-K t lo g r(a t ) + (a*-l)^(vI/( 7 *)-M/(^ 7 *)) . (10) 

*=i y i=i j=i J 

Since a 1 and a 2 are independent to each other and to oj and 1:2 as well. We can update them 
separately, similar to LDA. However, the terms involving 1:2 and u in C are, 

M N d K 1 K 2 , v 

d=l n=l i=l j = l ^ ' 

Any derivative of this would have terms containing 2/ ((1 + Lo)6\ vld + (1 — us)6 2 VJd ) in the inner- 
most summation, making it hard to obtain closed-form expressions. We instead convert the problem 
into an unconstrained problem, by defining the new objective function: 

mmimize e i,e^ - £e™,u + % E Ai ( E 6 ij ~ 1 ) + o ^ m ( ^ '~ 2 ) ' (12) 

i=l j = l i=l j = l 
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Figure 3: Given data generated from a mixture 
of Gaussian distributions (left), the density esti- 
mation obtained by the standard DPMM (middle) 
and our infinite M 3 model (right). Bottom row 
shows that even when there are no shared param- 
eters, our model performs as well as the DPMM. 
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Figure 4: Confusion matrices for Dataset-0 obtained by 
the DPMM (left) and infinite M 3 model (right). The inten- 
sity of each pixel represents the percentage of two flowers 
are grouped into one cluster. The ground-truth is a 3 x 3 
block-diagonal matrix. 



where {A;,^} impose a positive penalty for violating the constraint. Each {\i,r/i} is initialized 
with a small value and gradually increased when the corresponding constraint is violated. Given 
the weights, optimal 1:2 and cj are computed by the limited-memory BFGS algorithm, a standard 
quasi-Newton method. In practice, the penalties are only updated a few times before it converges. 

4.3 Hybrid M 3 Models 

There are many cases when one mixture model has a fixed number of components while the other 
one does not. For example, in the task of scene understanding, we can model object type and its 
pose separately. While objects can appear in countless poses, it is reasonable to assume a finite set 
of object categories. Therefore, we form our hybrid M 3 model as a combination of DPMM and a 
finite mixture model. We update the assignments z\ and zf in turn. We use standard DP to sample 
z\ given zf, and use maximum likelihood estimation to update zf given a set of sampled z\. 

5 Applications and Experimental Results 

In this section, we first illustrate how our M 3 models behaves on the task of Gaussian mixture 
modeling. Then we evaluate our finite M 3 model on the task of document modeling, on four different 
datasets and against three baselines. Finally, we apply our hybrid M 3 model to the task of estimating 
object arrangements in human environments. 

5.1 Gaussian Mixture Model 

In the classic Gaussian mixture model, data points are drawn from a set of different Gaussian dis- 
tributions, i.e. Xi\zi, (fj,i, Si), . . . , (fix, Sic) ~ N((j, Zi , S Zi ), whereas our 2D M 3 model uses two 
mixture models for means and covariances respectively, and draws data from 

x l \zl,zf,fi 1 ,...,fi K i,i: 1 ,...,j: K 2 ~ N(fi x i,E Z 2). 

We use conjugate prior for /i and £ (Gaussian and inverse Wishart distribution respectively), with 
same hyperparameters in both algorithms. 

We created a synthetic dataset for evaluating the results in terms of density estimation. For our 
model, it is x ~ ^ c J2dP( c ' diz 1 , z 2 )N(ijl c , £<j) averaged over 1000 samples. From the contours 
shown in Fig. [5] we can see that our method successfully identifies correct clusters in both sharing 
and non-sharing cases. The averaged normalized mutual information (NMI) for our model is 0.75 
and 0.96 compared to 0.66 and 0.97 of the DPMM. 

We also tested it on the Iris dataset (Dateset-0) containing 150 flowers from three species with four 
features for each sample^] The confusion matrices in Fig. |4] show that while both methods can 
correctly find the first species, DPMM is confused about the last two species. NMI of M 3 model is 
0.72 versus 0.67 for the DPMM. 

5.2 Document Topic Modeling 

We test our finite M 3 model (tagged 'FM3' in the figures) on four document corpora: Dataset-1 con- 
tains processed NIPS 1-12 proceedings with 1447 papers organized into 9 sections and 5270 words 
after removing words appeared more than 4000 times or fewer than 50 times|^]Dataset-2 includes 

4 http://archive.ics.uci.edu/ml/datasets/Iris 
5 http://www.gatsby.ucl. ac.uk/~ywteh/research/data.html 
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Figure 5: Results on Dataset-1. Perplexity of mixing VS and other 8 sections (left) and the average perplexity 
when changing the number of training documents from VS (right). The error bars are one standard error. 



randomly selected 1000 documents from the 20 newsgroups with a total of 1498 words after remov- 
ing stop-words and words in fewer than 5 documents HDataset-3 selects 1000 encyclopedia articles 
with 1200 words HDataset-4 takes 500 articles from Psychological Review with 1244 words|^] All 
the following results are based on 5 runs or 5-fold cross validation. Experiments on the first two 
datasets are in the same setup as in iHTI and B71 respectively. 

To investigate how well our model can learn general topics and section-specific topics, we train on 
80 articles from the VS (vision science) section and 80 articles from one of the other 8 sections. We 
test on the rest 47 VS papers. We use the perplexity [3| of the on-hold documents to evaluate the 

learned topic model: perplexity ( Wi, W£>) = cxp(— (J2d=i l°gp( w <i))/ J2d=i N d )- A lower 
perplexity indicates higher likelihood of the test data and thus better performance. 

Fig. [5]-left shows the perplexity obtained by LDA, HDP and our method. In the comparison with 
LDA, we set the LDA's topic number K equal to the total sum of topic numbers of M 3 model 
K 1 + K 2 , so that the two models have the same number of parameters. We see that our method 
performs significantly better than LDA across all eight sections for both 12 and 25 topics. This is 
due to that M 3 model has effectively have more topics than LDA. Such trends hold for different 
values of K, K 1 and K 2 . 

In presented results, we had set the second dimension of our M 3 model to have only a few topics 
(K 2 — 2 or 5). This enforces all documents from different sections have to 'share' them. Our 
method and HDP both outperform than LDA, showing that the ability of having shared topics is 
helpful. However, HDP does so in a hierarchy so that a sub-tree share similar topic proportions. It 
however does not reduce the number of topics needed to model. In fact, the number of topics used 
in HDP is around 55, far more than our 12 topics. 

In another experiment on Dataset-1, we change the number of training documents from VS from 
to 80, but always test on the rest 47 VS documents. When the number is small, the domain of 
the training and test dataset would be different and thus can be used to test the transfer of topic 
learning. Fig. fright shows the perplexity, averaged over all sections, with respect to different 
training documents. We can see that the performance of LDA largely depends on the number of 
VS papers, while the change in the perplexity of HDP and our model is less significant. Our finite 
M 3 method not only beats all the baselines but also gives the most consistent results in all cases. 
This demonstrates that 1) our model can learn the common topics of two different sections, and 2) 
it is less sensitive to having a small training set since the multi-dimensional membership effectively 
allows more documents to be used for estimating the parameters. 

We also test on the other three datasets and the results are shown in Fig. [6] Compared to the baselines, 
our M 3 model obtains the lowest perplexity and demonstrates its robustness in different scenarios. 

In order to explore what orthogonal topics our M 3 model discovered, we list one topic from each 
dimension in Fig. [7] Topics from the first row are quite different from each other containing some 
keywords for specific sections, such as 'digits' for the SP (speech and signal processing) while the 
topics in the bottom row are mostly from popular words in NIPS such as 'work' and 'algorithms'. 
This indeed reflects that M 3 represents topics parsimoniously. 



6 http://people.csail.mit.edu/jrennie/20Newsgroups/ 

7 http://www.cs. nyu.edu/~roweis/data.html 

8 http://psiexp. ss.uci.edu/research/programs_data/toolbox.htm 
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Figure 6: Results on Dataset2-4 , performed by FTM 
(reported in [47 1), LDA, HDP and our finite M 3 model. 
For LDA and FM3, we only report the best performance 
among different number of topics. 



Figure 7: Topics of VS combined with three other 
sections found by our finite M 3 model: top seven 
words (ranked by weight) of one topic from each 
dimension (K 1 = 10, K 2 = 2) is listed. Top- 
ics from first dimension are section-specific and 
different from each other while the second dimen- 
sion contains popular terms in NIPS and the topics 
in it do not change much. 



In M 3 models, setting K 1 and K 2 would affect the performance. Similar to LDA, the optimal value 
for the number of topics (K 1 + K 2 ) varies with the size and heterogeneity of the corpus, and may to 
try different values. The ratio K 1 /K 2 is interesting — in most datasets we found that an asymmetric 
value performs better, e.g., the result of setting K 1 — K 2 = 10 is worse than K 1 = 20, K 2 = 5. 



5.3 Object Arrangement 

We finally considered the task of learning object arrange- 



Table 1: Results of learning object ar- 
rangements evaluated by the difference 
in location and height (in meter). 
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1.59 0.16 


1.74 0.20 


DP 


1.65 0.11 


2.01 0.28 


M 3 


1.44 0.09 


1.63 0.11 



ment in human scenes, using the dataset in [ 18 1. It contains 
3D models for 20 scenes such as living rooms, kitchens and 
offices, and 19 different object categories. Each room was 
manually labeled with arrangements of multiple objects. We 
considered two learning scenarios: placing new objects that 
are not in the test room and placing in an empty room. We 
performed 5-fold cross validation on 20 rooms and evaluated the predicted arrangements based on 
the location difference and height difference from the labels. We compared our hybrid M 3 model 
(with one DPMM for human poses and one finite mixture model for object affordances) against 
both a finite mixture model and a single DPM. Results are shown in Table [T] Our method not only 
predicts arrangements closer to the ground truth but also places relevant but different type of objects 
together due to it allows objects with different affordances share the same human pose. 

6 Conclusion 

In this paper, we presented the multidimensional membership mixture (M 3 ) models which consists 
of multiple independent mixture models. Each data point is generated from a set of mixture compo- 
nents jointly, designated by its multidimensional membership. We derived three instantiations of M 3 
models — infinite, finite and hybrid M 3 . The infinite M 3 model uses multiple Dirichlet processes as 
the prior of memberships while the finite M 3 is built upon two LDAs. In both models, we introduced 
a tunable sharing parameter to increase its robustness in both sharing and no-sharing situations. The 
challenge in inference is addressed by Gibbs sampling and variational inference. We applied M 3 
models on topic modeling. Compared to the baselines, our model demonstrated its ability in achiev- 
ing better performance with fewer topics and in learning orthogonal topics. We also verified our 
model in the application of learning object arrangements. 
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